House Sales in King County - LIME explanations

Author: Piotr Grabysz

Date: 26.03.2021

The goal of this notebook is to learn LIME (Local Interpretable Model-agnostic Explanations) techniques.

I use House Sales in King County dateset and try to predict house's price based upon following features

Data loading and preparation

I make train test split. Becuase the range of prices is very big I do stratified split to ensure that in both datasets the ratio of cheap and expensive samples is similar.

Training the model - XGBoost

I train XGBoost model with parameters I found performing well in the previous homework

In the previoius notebook about SHAP I chose three different observations: one which price is low, one which price is around the average and one with very high price. I use the same observations in this notebook:

For the first, cheap example, my prediction is very inaccurate. But it is decent for the remaining two exemplary observations.

LIME

LimeTabularExplainer needs providing names of all features and indices of categorical features.

1) Low price example

One should be careful when analyzing explanation for this particular model, because my xgboost heavily overestimates the price (from 80,000\$ to 150,118\\$), and the local linear model overestimates it even further: to 379,818\$! This might be because there is a little data with prices that low. And of a lot of data have prices 10 or even 100 times greater (so it might be a better idea to train the model on logarithm of house prices).

While keeping this in mind, we see that the factor which has great positive impact on the price is latitude. As I showed in the previous notebook more expensive houses are located in the center area, along the river:

As can be seen, this house lies at latitude similar to expensive houses, however it is located in a completely different area.

The lack of view of waterfront lowers the price significantly. It sounds understandable, but I am surprised that this plays the biggest role, much bigger for example than small square footage. Personally, I would put more weight on square footage and grade than on acces to waterfront view.

grade, sqft_living (square footage of the living space) and view contributes negatively in the price. This is plausible, since they values are low in comparison to other houses.

yr_built < 1951 (the first quartile) has small positive contribution. It is hard for me to say why without knowing Seattle's history and architecture. I plotted price distribution conditioned on yr_built, but based upon it there is no evidence that having yr_built < 1951 supports bigger prices.

2) Average price example

For the average price example, the prediction of my model is pretty accurate (real: 540,000\$, predicted: 539,440\\$) and the local model's prediction is also not so far away from the truth: 511,320\$

The biggest, negative impact is made by waterfront=0. As I mentioned in the previous example, it is clear that not having the waterfront view is worser than having it, but it is quite concerning that this is the most important factor.

The features having positive contribution in the price are square footage of 15 nearest neighbors, square footage above the ground level and year of built.

It is surprising to me that square footage of the actual living space of the house has much lower impact on the prediction than square footage of neighbors. Maybe one of this two variables is somehow redundant?

We can see on the plot above that these variables are correlated, but there is a lot variance as well, so I would't call them redundant.

The high price example

For this example my model is very accurate: it predicted 6,908,374\$ while the real price is 6,885,000\\$. The local model predicted terribly bad, predicting 2,827,749\$ so less than a half of the price. It might becuase this expensive example is an outlier:

This house is the third most expensive house. So maybe there wasn't much data supporting such high prices and the local model couldn't learn them.

Anyway, predicted 2,827,748\$ is still a lot and almost all the features contributes positevely in reaching such price. The most important one is the highest possible grade and view, as well as big living space and exceptional number of bedrooms.

As for the two previous examples, the lack of waterfront view is lowering the price. Well, for the estate of this quality such a thing might be an issue.

Training a second model - neural network

Low price example

Neural network explanation (one should be aware that all the printed values are different than those in the dataset, because they were scaled before training the network):

Lets recall the explanation of the XGBoost model:

We can see that the explanations are similar, with waterfront and lat being the most important

Average price example

Neural network explanation:

XGBoost explanation:

These explanations look similar. No waterfront is the most important factor. Longidute, view, condition sqft_15 are important for both models. The only major difference is that sqft_above is second most important feature for the neural network, whereas it has almost no importance for xgb.

High price example

Neural network explanation:

XGBoost explanation:

The above explanations are very similar.

Some final thoughts

One interesting thing I've noticed is persistently important role of waterfront. The lack of waterfront view is the most important factor in 4 out of 6 cases. And in all of them it is one of the most important factors lowering the price.

I have two questions:

1) Is waterfront variable so important when waterfront==1?

Lets choose some houses with waterfront

For both chosen examples waterfront is the most important feature again.

2) Is this role of waterfront visible in prices distribution?

There are almost no samples in the dataset with waterfront==1. I can't understand why my models give this feature such a big role.

My last comment is about plausibility of the explanations. For the selected samples, explanations for the XGBoost and neural network were similar. I believe that this might be an evidence that this models are stable. But if their explanations differ, I cannot decide which explanation is more plausible without having any extra expert knowledge.